Semantic Grouping Network for Video Captioning
نویسندگان
چکیده
This paper considers a video caption generating network referred to as Semantic Grouping Network (SGN) that attempts (1) group frames with discriminating word phrases of partially decoded and then (2) decode those semantically aligned groups in predicting the next word. As consecutive are not likely provide unique information, prior methods have focused on discarding or merging repetitive information based only input video. The SGN learns an algorithm capture most mapping associates each phrase relevant - establishing this allows related be clustered, which reduces redundancy. In contrast methods, continuous feedback from words enables dynamically update representation adapts caption. Furthermore, contrastive attention loss is proposed facilitate accurate alignment between without manual annotations. achieves state-of-the-art performances by outperforming runner-up margin 2.1%p 2.4%p CIDEr-D score MSVD MSR-VTT datasets, respectively. Extensive experiments demonstrate effectiveness interpretability SGN.
منابع مشابه
Reconstruction Network for Video Captioning
In this paper, the problem of describing visual contents of a video sequence with natural language is addressed. Unlike previous video captioning work mainly exploiting the cues of video contents to make a language description, we propose a reconstruction network (RecNet) with a novel encoder-decoder-reconstructor architecture, which leverages both the forward (video to sentence) and backward (...
متن کاملAutomatic Video Captioning using Deep Neural Network
Video understanding has become increasingly important as surveillance, social, and informational videos weave themselves into our everyday lives. Video captioning offers a simple way to summarize, index, and search the data. Most video captioning models utilize a video encoder and captioning decoder framework. Hierarchical encoders can abstractly capture clip level temporal features to represen...
متن کاملConsensus-based Sequence Training for Video Captioning
Captioning models are typically trained using the crossentropy loss. However, their performance is evaluated on other metrics designed to better correlate with human assessments. Recently, it has been shown that reinforcement learning (RL) can directly optimize these metrics in tasks such as captioning. However, this is computationally costly and requires specifying a baseline reward at each st...
متن کاملDeep Learning for Video Classification and Captioning
Accelerated by the tremendous increase in Internet bandwidth and storage space, video data has been generated, published and spread explosively, becoming an indispensable part of today's big data. In this paper, we focus on reviewing two lines of research aiming to stimulate the comprehension of videos with deep learning: video classification and video captioning. While video classification con...
متن کاملMultimodal Memory Modelling for Video Captioning
Video captioning which automatically translates video clips into natural language sentences is a very important task in computer vision. By virtue of recent deep learning technologies, e.g., convolutional neural networks (CNNs) and recurrent neural networks (RNNs), video captioning has made great progress. However, learning an effective mapping from visual sequence space to language space is st...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: Proceedings of the ... AAAI Conference on Artificial Intelligence
سال: 2021
ISSN: ['2159-5399', '2374-3468']
DOI: https://doi.org/10.1609/aaai.v35i3.16353